absolute number
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Zhu, Dawei, Meng, Rui, Chen, Jiefeng, Li, Sujian, Pfister, Tomas, Yoon, Jinsung
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
d81f9c1be2e08964bf9f24b15f0e4900-Reviews.html
This paper proposes a neural network architecture that falls somewhere between multilayer perceptrons (MLPs) and sigmoid belief networks (SBNs). The motivation is to permit multimodal predictive distributions (like SBNs) by using stochastic hidden units, but adds deterministic hidden units to smooth the predictive distribution in the case of real-valued data. The paper's main technical contribution is an EM-style algorithm where the E-step uses importance sampling to approximate the posterior and the M-step uses backpropagation to update the parameters. The experiments demonstrate the model's utility on several synthetic and real datasets. Quality: I liked this paper; the use of stochastic and deterministic units seems reasonably justified.
Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal Election
Fรคrber, Michael, Schwade, Jannik, Jatowt, Adam
Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question of how diversity in news articles can be measured holistically, i.e., with respect to (1) variety, (2) balance, and (3) disparity, considering individuals, parties, and topics, has not been addressed. In this paper, we present a framework for determining diversity in news articles according to these dimensions. Furthermore, we create and provide a dataset of Google Top Stories, encompassing more than 26,000 unique headlines from more than 900 news outlets collected within two weeks before and after the 2021 German federal election. While we observe high diversity for more general search terms (e.g., "election"), a range of search terms ("education," "Europe," "climate protection," "government") resulted in news articles with high diversity in two out of three dimensions. This reflects a more subjective, dedicated discussion on rather future-oriented topics.
Comparatives, Quantifiers, Proportions: A Multi-Task Model for the Learning of Quantities from Vision
Pezzelle, Sandro, Sorodoc, Ionut-Teodor, Bernardi, Raffaella
The present work investigates whether different quantification mechanisms (set comparison, vague quantification, and proportional estimation) can be jointly learned from visual scenes by a multi-task computational model. The motivation is that, in humans, these processes underlie the same cognitive, non-symbolic ability, which allows an automatic estimation and comparison of set magnitudes. We show that when information about lower-complexity tasks is available, the higher-level proportional task becomes more accurate than when performed in isolation. Moreover, the multi-task model is able to generalize to unseen combinations of target/non-target objects. Consistently with behavioral evidence showing the interference of absolute number in the proportional task, the multi-task model no longer works when asked to provide the number of target objects in the scene.
Automated Driving: How will it affect me?
He enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on their second project - R Shiny, due on 4th week of the program. The original article can be found here. Google, Tesla, and other automakers such as BMW, Daimler-Mercedes, and General Motors are all presenting visions of a future where most or all of the responsibilities and tasks of driving are no longer yours. A big benefit of automated driving will be an anticipated reduction in fatal motor vehicle accidents.
Machine Learning for Media Monitoring - with Signal Chief Data Scientist -
Episode Summary: One facet of business that nearly any industry has in common is the need to stay on top of news in their respective market, including competitor strategies or understanding changes in news related to the field. Media monitoring is a domain that machine learning (ML) is well suited for, with it's ability to coax out headlines, contextual information, and financial data from the seemingly endless stream of social, blog, and other information on the web today. Signal is a company that uses ML specifically for these purposes. In this episode, we speak with Signal's Chief Data Scientist and Co-founder Dr. Miguel Martinez, who dives into real business use cases illustrating the use of machine learning for media monitoring across industries. Brief Recognition: Dr. Miguel Martinez is chief data scientist at Signal, where he manages a team of data scientists to transform best algorithms from the fields of machine learning, information retrieval, and natural language processing into large-scale commercial products.